<html><head><link href="./screen.css" rel="stylesheet" type="text/css" />
          <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
          </script>
          <script type="text/javascript">
          _uacct = "UA-3418876-1";
          urchinTracker();
          </script>
        </head><body><div id="top"><div id="main_navigation"><ul><li>Documentation</li><li><a href="contribute.html">Contribute</a></li><li><a href="index.html">Home</a></li></ul></div></div><div id="middle"><div id="main_content"><div id="secondary_navigation"><ul><li><a href="syntactic_recognition.html">Syntax</a></li><li><a href="semantic_interpretation.html">Semantics</a></li><li><a href="using_in_ruby.html">Using In Ruby</a></li><li>Advanced Techniques</li></ul></div><div id="documentation_content"><h1>Pitfalls</h1>

<h2>Left Recursion</h2>

<p>An weakness shared by all recursive descent parsers is the inability to parse left-recursive rules. Consider the following rule:</p>

<pre><code>rule left_recursive
  left_recursive 'a' / 'a'
end
</code></pre>

<p>Logically it should match a list of 'a' characters. But it never consumes anything, because attempting to recognize <code>left_recursive</code> begins by attempting to recognize <code>left_recursive</code>, and so goes an infinite recursion. There's always a way to eliminate these types of structures from your grammar. There's a mechanistic transformation called <em>left factorization</em> that can eliminate it, but it isn't always pretty, especially in combination with automatically constructed syntax trees. So far, I have found more thoughtful ways around the problem. For instance, in the interpreter example I interpret inherently left-recursive function application right recursively in syntax, then correct the directionality in my semantic interpretation. You may have to be clever.</p>

<h1>Advanced Techniques</h1>

<p>Here are a few interesting problems I've encountered. I figure sharing them may give you insight into how these types of issues are addressed with the tools of parsing expressions.</p>

<h2>Matching a String</h2>

<pre><code>rule string
  '"' (!'"' . / '\"')* '"'
end
</code></pre>

<p>This expression says: Match a quote, then zero or more of any character but a quote or an escaped quote followed by a quote. Lookahead assertions are essential for these types of problems.</p>

<h2>Matching Nested Structures With Non-Unique Delimeters</h2>

<p>Say I want to parse a diabolical wiki syntax in which the following interpretations apply.</p>

<pre><code>** *hello* ** --&gt; &lt;strong&gt;&lt;em&gt;hello&lt;/em&gt;&lt;/strong&gt;
* **hello** * --&gt; &lt;em&gt;&lt;strong&gt;hello&lt;/strong&gt;&lt;/em&gt;

rule strong
  '**' (em / !'*' . / '\*')+ '**'
end

rule em
  '**' (strong / !'*' . / '\*')+ '**'    
end
</code></pre>

<p>Emphasized text is allowed within strong text by virtue of <code>em</code> being the first alternative. Since <code>em</code> will only successfully parse if a matching <code>*</code> is found, it is permitted, but other than that, no <code>*</code> characters are allowed unless they are escaped.</p>

<h2>Matching a Keyword But Not Words Prefixed Therewith</h2>

<p>Say I want to consider a given string a characters only when it occurs in isolation. Lets use the <code>end</code> keyword as an example. We don't want the prefix of <code>'enders_game'</code> to be considered a keyword. A naiive implementation might be the following.</p>

<pre><code>rule end_keyword
  'end' &amp;space
end
</code></pre>

<p>This says that <code>'end'</code> must be followed by a space, but this space is not consumed as part of the matching of <code>keyword</code>. This works in most cases, but is actually incorrect. What if <code>end</code> occurs at the end of the buffer? In that case, it occurs in isolation but will not match the above expression. What we really mean is that <code>'end'</code> cannot be followed by a <em>non-space</em> character.</p>

<pre><code>rule end_keyword
  'end' !(!' ' .)
end
</code></pre>

<p>In general, when the syntax gets tough, it helps to focus on what you really mean. A keyword is a character not followed by another character that isn't a space.</p></div></div></div><div id="bottom"></div></body></html>